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Abstract 

Thif  paj^r  $hows  how  computatitm-intemive  engineering  prob- 
lemt  can  be  computed  efficientiy  on  a  mataively  parallel  com- 
puter. Designed  mth  tens  of  thousands  of  processing  elements, 
these  machines  now  offer  $ubsUmiially  improved  eompuUtion 
times  and  improved  cost  performance,  and  in  the  future  promise 
to  rapidly  reach  higher  performance  leveU.  We  wUl  discuss 
the  ease  of  developing  data  pardlel  dgorithms  for  VLSI  circuit 
fdacement,  VLSI  circuit  simtdation  and  seitmic  migration. 


1      Jntroduction 

Massively  parallel  compnters  with  tens  of  thousands  of  process- 
ing elements  which  can  operate  in  parallel  are  entering  the  mar- 
ket place.  Because  of  the  regiilBrity  and  simplicity  of  the  design, 
these  machines  offer  a  substantial  cMt  performance  improve- 
ment over  conventional  machines.  Purthennore,  these  machines 
promise  to  reach  substantially  higher  absolute  performance  lev- 
els in  the  coming  years  than  conventional  supercomputers. 

Key  questions  are  often  raised:  Are  computation-intensive  ap- 
pUcations  paraUehaable  such  that  these  machines  can  be  used 
efficiently?  Can  th^e  machines  be  programmed  with  reason- 
able effort? 

This  paper  will  address  th^e  qu^tions  by  discussmg  in  some 
detail  three  examples  of  computation  intensive  algorithnB  from 
the  engineering  domain,  namely  VLSI  circuit  placement,  VXSI 
circuit  simulation  and  seismic  migration.  A  goal  is  to  provide 
some  basic  insight  into  the  proems  of  designing  algorithms  for 
a  massively  parallel  computer  from  which  the  reader  can  ex- 
trapolate how  other  appUcations  can  be  computed  on  ruch  a 
machine. 

Virtually  aU  computation-intensive  algorithms  deal  with  large 
amounts  of  data.    VLSI  design  deals  with  the  placement  of 


thousands  of  circuit  elements  or  with  the  simulation  of  net- 
works of  hundreds  of  thousands  of  transistors.  Seismic  migra- 
tion computes  millions  of  gridpoints  of  »  subsurface  image,  etc. 
Massively  parallel  computing  t^o  advantage  of  the  fact  that 
in  those  applications,  thousands  of  daU  objects  can  be  oper- 
ated on  in  paraUeL 

Essential  to  the  ease  of  parallel  programming  is  a  simple  model 
for  parallel  computation  which  is  efficiently  implemented  by  the 
underlying  hardware.  We  wUl  use  the  Connection  Machine® 
system^  as  an  example  to  briefly  summarize  this  modeL  A  pro- 
grammer stKts  with  familiar  concepts  for  structuring  complex 
data  objects  such  as  graphs  for  representing  drcmt  schematics 
or  arrays  for  representing  images.  The  new  concept  is  that  the 
variables  which  represent  the  vertic^  of  a  graph  or  the  elements 
of  an  array  can  be  declm-ed  to  be  parallel  variables.  The  other 
essential  concepts  are  parallel  operations  which  are  performed 
on  aU  parallel  variabte  simultaneously  and  an  operation  for  se- 
lecting a  subset  of  parallel  variables  in  order  to  restrict  paraDel 
operations. 

The  Connection  Machine  which  supports  this  model  can  be 
viewed  as  an  extension  of  the  memory  of  a  conventional  front 
end,  e.g.,  a  VAX,  where  partitions  of  the  memory  have  theL' 
own  execution  units  which  execute  parallel  operations.  In  addi- 
tion, the  Connection  Machine  system  has  a  very  sophisticated 
commimications  network  which  permits  aU  pwallel  execution 
\mits  to  access  individual  memory  locations  in  the  entire  ma- 
chine in  parallel. 

PM-aUel  variables  are  stored  in  individual  processing  elements, 
i.e.,  processor  memory  pmn.  Parallel  operations  are  executed 
by  the  parallel  execution  units.  The  communication  system 
is  involved  if  compound  operations  are  performed  on  parallel 
variables  which  involve  access  to  other  variables;  via  pointers, 
for  instance. 


®  Connection  Maclmie  u  s  regijtered  tiEdenmrk  of  TWnkini  Machines 
Corpotation 


The  rtmModer  of  thu  paptr  dweribei  Ui«  Mtmce  of  pafdld  aJ- 
gorithaoi  for  thxtt  ®apn®«iB|  »ppMc»tions,  Stction  2  dticribo 
VLSI  PlKemeat,  Seetisn  3  deiesibti  VLSI  Ckeuit  Simuktion, 
and  Seetioa  4  duk  with  teismie  migratioo. 

2     VLSI  Placemeni 

Smuemtom  VLSI  circiuts  with  more  th&a  10,000  etUs  or  parts 
hmvt  btcomt  quite  commoa.  Placing  this  aumber  of  pMti  oa  & 
eMp  i«eh  that  tht  asm  takm  up  by  intereoantct  msu  ii  min- 
imiied  is  a  difficult  uid  tim@  eouumiog  problem. 

Many  algorithm*  we  in  m@  which  tmckle  thu  problem.  Often 
these  alf orithnu  handle  ipedfic  placement  problems  quite  weU 
but  cannot  be  easily  adjusted  to  varying  condition!  due,  for  in- 
stance, to  tedmolosf  change*.  A  method  which  seems  to  yield 
good  nptimiyatinn  reiulti  on  a  broad  spectrum  of  placement 
problem!  is  the  limulated  annealing  method^.  Unfortunately 
gimul*ted  aaneoling  is  '^ery  computation  intensive  with  com- 
putation times  of  10  to  36  hours  on  a  VAX/780  bmg  reported 
for  dretuts  with  1500  parU.* 

Reeeli  that  the  innerloop  of  the  simtikted  MmeaHng  algorithm 
coasigts  of  generatiag  a  ntw  state  by  moving  a  paxt  or  by 
swapping  parts,  c^ctikting  the  change  in  cost,  Mid  calculating 
wheth^  to  accept  or  reject  the  move.  The  intuitive  idea  for 
par^eUaing  this  algorithm  is  to  p«form  many  of  these  steps 
in  pKaUd,  i.e.,  move  or  swap  many  parts  in  pamUd  ,  ev^uate 
the  cost  chiuige  for  each  of  th«e  moves  in  pamUel  and  caleu- 


kt®  whether  to  accept  or  rtject  t»ch  of  the  mov©»  in  fmmBd. 
Unfortunately,  such  par  Jld  movw  we  in  gener  J  not  indepoi- 
dent.  For  imtaace  if  two  pai«  swap  their  p^itiom  md  if  thmc 
pain  share  a  common  net,  then  the  change  in  wirelength  for 
one  move  is  generally  dependent  on  whether  the  other  pair  ae- 
tuaUy  swaps  or  not. 

The  Miential  data  objects  used  during  plaeemtnt  optimigatioa 
we  the  netUst  of  the  circuit  and  the  slots  on  the  surface  of  the 
cMp  onto  which  parts  cm  be  pkced.  We  assuaia  a  strndard 
cdl  technology  wher®  lU  P«t8  have  identical  heights,  but  var- 
ious widths.  The  netUst  i»  rcprwenttd  by  wsignini  each  part 
Mid  each  net  to  a  separate  proceiior,  that  is  the  data  struetujc 
which  represents  a  part  or  a  net  is  stored  in  the  memory  of  a 
separate  proctasor.  Conventional  pointers  are  used  to  reproeat 
the  Uttks  between  parts  and  nodes.  The  chip  surface  is  repre- 
sented by  storing  each  slot  m  a  separate  proc^sing  element. 
Pointers  between  pMts  and  slots  represent  the  placement  of 
those  parts  in  available  slots.  Figure  1  shows  an  eximiple  of 
how  a  circuit  to  be  placed  mxd  a  chip  surface  is  mapped  onto 
processing  elements. 
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Ftgurs    1:    BtpraMntalion  of  Circuit  and  Slots  on  Surtaea  of  Chip  on  ProeMslne 
(SlisO#d  srus     reprMMI  proMsaing  sisirants) 


The  rtmainder  of  thu  paper  deicrib^  the  ^lence  of  parallel  J- 
gorithms  for  thre«  enpaeering  applications,  Section  2  dwcribei 
VLSI  Pkcement,  Section  3  describei  VLSI  Circuit  SLmuktioa, 
and  Section  4  deals  with  teiimic  migration. 

2     VTiSI  Placement 

Semicustom  VLSI  circuits  with  more  than  10,000  cells  or  parts 
have  become  quite  common.  Pl&cing  this  number  of  parts  on  a 
chip  lueh  that  the  area  t^en  up  by  intereoniieet  wira  i>  min- 
imised is  a  difficult  uid  time  eonsmmng  problem. 

MMiy  idgorithms  are  in  uie  which  tackle  this  problem.  Often 
these  algorithms  handle  specific  placement  problems  quite  weU 
but  cannot  be  easily  adjusted  to  varying  conditions  due,  for  in- 
stsmce,  to  technolo^  change.  A  method  which  seems  to  yield 
good  optinmation  results  on  a  broad  spectrum  of  placement 
problems  is  the  simulated  annealing  method^.  Unfortunately 
simulated  axme^ing  is  very  computation  intensive  with  com- 
putation tunes  of  10  to  36  hours  on  a  VAX/780  being  reported 
for  circuits  with  1500  parte.* 

Recall  that  the  imierloop  of  the  simulated  umeaUng  algorithm 
consists  of  generating  a  new  state  by  moving  a  part  or  by 
swapping  parts,  calculating  the  change  in  cost,  and  calculating 
whether  to  accept  or  reject  the  move.  The  intuitive  idea  for 
paxaUeliziBg  this  algorithm  is  to  perform  many  of  these  steps 
in  pardlel,  i.e.,  move  or  swap  many  parts  in  parxdlel ,  ev^uate 
the  cost  change  for  each  of  these  moves  in  juraUel  and  calcu- 


late whether  to  accept  or  reject  mth  of  the  moves  in  ^raUd. 
Unfortimately,  such  parcel  mov®  are  in  general  not  indepen- 
dent. For  inst«ice  if  two  pairs  swap  their  p<»ition«  and  if  thme 
pairs  share  a  common  net,  then  the  change  in  wirelength  for 
one  move  is  generally  dependent  on  whether  the  other  pair  ac- 
tually swaps  or  not. 

The  essential  data  objects  used  during  placement  optimiiation 
are  the  netUit  of  the  circuit  and  the  slots  on  the  surface  of  the 
chip  onto  which  pwts  can  be  placed.  We  Msume  a  standard 
ceU  technology  where  Jl  parts  have  identic^  heights,  but  vm- 
ious  widths.  The  aetUst  it  repr«eated  by  assigning  each  part 
and  each  net  to  a  separate  processor,  that  is  the  data  structure 
which  represents  a  part  or  a  net  is  stored  in  the  memory  of  a 
separate  processor.  Conventional  pointers  are  used  to  represent 
the  Unks  between  parts  and  nod^.  The  chip  surface  is  repre- 
sented by  storing  each  slot  in  a  separate  processing  element. 
Pointers  between  parts  and  slots  represent  the  placement  of 
those  parts  in  avMlable  slots.  Figure  1  shows  an  example  of 
how  a  circuit  to  be  placed  and  a  chip  surface  is  mapped  onto 
processing  elements. 
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Ftgurs    1:   nsprsMnlalion  of  Circuit  and  Slots  on  Suilaes  of  Chip  on  ProcMsln^ 
(Sha«#d  areas     rsprsuni  procasaing  e<sm#nta) 
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Abstract 

Thu  paper  thow$  how  computation-intenaive  engineenng  prob- 
lema  can  be  computed  effietenUy  on  a  maM$ively  pandkl  com- 
puter. Designed  with  tens  of  ihouiandt  of  proeeasing  elements, 
ihete  machines  now  offer  tnbMtantidly  improved  computation 
time$  and  improved  &>ai  performance,  and  in  the  future  pnqmwe 
to  rapidly  reach  higher  p^ormance  leveU.  We  wai  discuss 
the  ease  of  developing  data  paraUel  dgorithnu  for  VLSI  drcuii 
placement,  VLSI  etreuit  simulation  and  seimic  migration. 


1      T^trnciuetion 

MMsiTely  pwaUel  eompnt«s  with  tenj  of  thousands  of  process- 
us dements  wMeh  c«n  op«ate  m  pisaUd  aie  eateriag  the  mar- 
ket pkce.  Beeatue  of  the  regukrity  and  simpUdty  of  th«  dwi^, 
thwe  mAchmm  offe  *  snbgtmtial  c«t  perfonnaact  tepjoipe- 
meat  over  eonv^iond  macMnM.  Fnxtheraore,  these  masMaes 
promue  to  reach  substMtkUy  Mghef  absolute  perfofmAnee  lev- 
ek  ja  the  coming  years  than  convention^  lupercomputeis. 

Key  questions  are  often  rmsed;  Aje  eomputatioa-intensi^e  ap- 
plications par^ehxable  such  that  thtae  madunes  can  be  used 
effidently?  Can  th^e  machines  be  programmed  with  reason- 
able effort? 

This  paper  wiU  addr^s  thwe  qu^tions  by  discussing  in  some 
detail  three  examplts  of  computation  intensive  algorithm  from 
the  engineering  domdn,  namdy  VLSI  circuit  pkcement,  VLSI 
ctfcuit  simolation  md  sdimic  joigration.  A  god  is  to  provide 
gome  bmsie  insight  into  the  proe«8  of  desiring  dgorithms  for 
a  mMsivdy  parallel  computer  from  which  the  reader  can  ex- 
trapolate how  other  applications  can  be  computed  on  iruch  & 
midline. 

Virtually  aU  comput*tion-int€niive  dgorithms  ded  with  large 
amounts  of  data.    VLSI  design  deds  with  the  placement  of 


thousands  of  drcuit  dements  or  with  the  simulation  of  net- 
works of  htmdreds  of  thousands  of  transistors.  Sdsmic  migra- 
tion computes  milliom  of  gridpomts  of  a  subsurface  image,  etc 
Messivdy  parallel  computia|  tekn  advantage  of  the  fact  that 
in  those  applications,  thou*mds  of  data  objects  can  be  oper- 
ated on  in  paraUd. 

Essentid  to  the  ease  of  psrafld  propamming  is  a  simple  modd 
for  paraUd  computation  which  is  ardently  implemented  by  the 
underlying  hardware.  We  wOl  uie  the  Connection  MacMae* 
system^  as  an  example  to  briefly  smnmMize  this  model  A  pro- 
grammer starts  with  famfliar  concepts  for  structuring  complex 
data  objects  such  as  graphs  for  repr«enting  circuit  schematics 
or  arrays  for  representing  images.  The  new  concept  is  that  the 
variable  which  reproent  the  vertices  of  a  graph  or  the  dements 
of  an  KTsy  can  be  dedMed  to  be  paraUd  variables.  The  other 
essentid  concepts  are  paraUd  operations  which  are  perfonned 
OB  aH  pasdld  variable  limultiseensly  and  an  operation  for  st- 
lectiag  ft  subset  of  pardld  VKiabl«  in  ordo'  to  restrict  pwaDd 
operations. 

The  Connection  Machme  winch  supports  this  modd  can  be 
viewed  as  an  extension  of  the  memory  of  a  eonventiond  front 
end,  e.g.,  a  VAX,  where  partitions  of  the  memory  have  thdr 
own  execution  units  which  execute  paraUd  operations,  hx  addi- 
tion, the  Conaeetion  Maduat  system  has  a  very  sophisticated 
eommimications  network  wMeh  psmits  dl  paraUd  execution 
units  to  aecMS  individnd  memory  locations  in  the  entire  ma- 
chine in  paraUeL 

Pardlel  variables  are  stored  ia  individtid  processini  dements, 
Lfc,  processor  memory  pairs.  PwaUd  opeations  are  executed 
by  the  paraUd  execution  units.  The  con^uaication  system 
is  involved  if  compound  operations  Ke  performed  on  paiaUd 
variables  which  involve  access  to  other  vmables;  via  pointers, 
for  instance. 


®  Coaaeetion  Mmchme  m  &  regiateyed  timdomik  of  Tluaksig  Madiiaes 
CorporaUon 


With  thk  rtpreieatation  of  ©bjtcts  ia  proteasing  elements,  the 
Jgorithm  preceedi  u  foUows: 

Select  randomly  »  lubiet  (®.g.,  40%)  of  parti 
/"  »election  of  new  stmt©  •/ 

P«rts  (selected)  randomly  gdeet  a  new  ilot  to  move  towards 
/*Paxts  swap  pcMitionj  with  parts  found  in  the  leleeted  slot*/ 
/*the  implementation  guarukte^  that  parts  found  in  new  slots 
are  disjoint  from  the  originaQy  selected  parts^/ 
/*  e^eulation  of  change  in  wirelength  (it  done  individually  for 
®»eh  »w»ppinf  pair)  */ 
Pwti  send  their  new  poiition  to  nodes 
Nodes  calculate  new  wireloigth  and  return  results  to  parts 
Pvtt  calculate  change  in  wireloigth  for  aU  nodes  they  are  con- 
nected to 

Parts  calculate  change  in  wirelength  and  associated  cost  for 
swapping  pairs 

/■•  calculation  of  rowlength  change  */ 

Parts  compute  change  of  rowlength  and  associated  cost  for  each 
swapping  pair 

Pwts  compute  final  cc«t  for  swaps 
/""  accept  or  reject  */ 

Parts  compute  whether  to  accept  or  reject  swap 
Moves  are  fmaUsed  by  updating  data  in  parts  and  slots 

With  the  pudld  implementation  of  the  annealing-based  al- 
gorithm we  have  tried  to  stay  close  to  the  first  two  stages  of 
the  TimberWolf  ^gorithm^.  Instead  of  trying  to  only  permit 
independent  moves,  we  have  implemented  an  algorithm  which 
tolerates  some  error  in  the  cost  calculation.  Errors  are  due  to 
the  iatrference  of  paralldi  swaps  (e.g.,  two  swapping  pairs  may 
share  a  eon[mon  net).  Con^ared  to  an  algorithm  with  only  in- 
dependent swaps  which  was  outlined  for  a  ginq>le  version  of  an 
umeaUng-bued  algorithm  in  Reference  4,  the  above  ^gorithm 
uses  lets  time  p^  iteration  and  permits  more  pardlel  moves 
per  ito'ation.  During  the  initial  stage  of  the  annealing  pro- 
cess approsmately  757o  of  the  parts  participate  in  attempted 
mov^.  Aj  the  temperature  decreases  the  ^gorithm  decreases 
this  percentage  as  a  global  means  to  reduce  the  oror.  At  the 
same  time  the  numW  of  iterations  per  temperature  step  is 
increased.  Specific  pwametert  for  these  functions  were  deter- 
mined e^erimentidly. 

The  optimization  results  of  this  algorithm  are  practically  the 
same  as  for  TimberWoH  The  eseeution  time  of  this  algorithm 
for  a  circuit  with  more  than  6,000  parts  is  appromnately  2 
houiB  on  a  Connection  Machine  con%aration  with  32K  proces- 
sors. On  a  fuU  64K  procasor  conAguration  a  circuit  with  ap- 
proximately 15,000  parts  (40K  gate  equivalents)  can  be  placed 
in  2  hours. 


3     VLSI  Circuit  Simulation 

Today's  VLSI  circuits  commonly  have  tens  to  hundreds  of  thou- 
sands of  transistors  on  a  single  chip.  For  accurate  verification 
of  the  timing  behavior,  circtuts  are  simulated  at  the  level  of 
an^og  waveforms.  This  simnlation  is  so  computation-intensive 
thmt  typically  only  drcuits  with  50  to  several  hundred  transis- 
ton  we  gimukted  at  •  tiuM. 


The  circuit  simulation  problem  k  formulated  m  a  system  of 
nonlinear  diff'erential  equations^.  A  sparse  mAtrix  whose  sise 
depends  on  the  circuit  siie  needs  to  be  solved  in  the  innerloop 
of  the  algorithm  which  soIvm  this  problem. 

In  this  paper  we  discuss  a  relaxation-based  approach®'^'*'"  to 
solving  thu  problem,  specifically  a  Gauss-Jacobi  rekzation  at 
the  nonlinear  equation  level*".  This  iterative  method  permits 
parallel  e^cuktion  of  aU  unknown  voltages  of  a  circuit  in  the 
innerloop  of  the  algorithm. 

The  circuit  to  be  simulated  is  represented  in  the  machine  by 
storing  each  device  «id  each  node  in  a  separate  processing  ele- 
ment (Fig.  2).  Pointers  or  reference  represent  the  connectivity 
between  devices  and  nodes. 


Figure  2:  RepresentBtlon  of  Circuit  on  Processing  Elements 
(Shaded  areas  represent  processing  elements) 


With  this  representation  of  the  circuit,  the  algorithm  proceeds 
as  foUows: 

For  eadi  time  point  the  following  steps  are  iterated  until  con- 
vergence is  reached 

aU  nodes  send  their  voltages  to  the  device  terminals  (Fig.  3) 
all  devices  compute  their  new  parameters  (Fig.  4) 
all  devices  send  their  parameter!  to  the  nod«  (Fig.  5) 
aU  nodes  compute  their  new  voltages  (Fig.  6) 

Circuits  with  more  than  10,000  devices  can  be  simulated  in  a 
few  minutes  on  a  Connection  Machine  system  with  64K  proces- 
sors. The  performance  of  the  inq>lementation  can  be  improved 
by  storing  nodes  and  parts  in  the  same  physical  processor  since 
only  one  type  of  object  is  computing  at  any  point  in  time. 
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Figure  3:  Nodes  send  Uoltages  to  Oeuices 


Figure  6:  Nod@s  Compute  n@u>  Uoltages 


Figure  4:  Dtuices  Compute  nam  Parameters 


^[ 


4     Seismic  Migration 

For  oil  and  gu  exploration  geophyiiditi  tu@  teismie  meth- 
odi  for  acquiring  sabsiufKe  itraetiual  and  stratigrapMe  m- 
foinmtion  withoat  drilUng  weUs.  Fropagatinf  d^tie  mmm  are 
recorded  at  the  surface  of  the  ewth  and  gtibieqnently  procnstd 
in  order  to  obtain  aa  image  of  the  gubstu&ee.  Seismie  xaigra- 
tion  h  ft  space  and  time  variant  filtering  process  which  maps 
observed  space-time  amplitude  data  into  depth  with  correct 
amplitude  at  true  spatial  pmitions.^' 

The  method  for  subsuzfaee  imaging  disewied  in  tMs  paper  is 
a  revewe  time  smgfstion  method*^;  spedficaUy  we  deal  with 
the  aeotistie  ease  in  which  wave  propagation  is  governed  bf  the 
acoiutic  wave  equati^  The  algorithm  is  .bued  on  a  boimdar^- 
value  solution  of  the  2-D  acoustic  wave  equation 
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where  U  b  the  acoustic  wave  field.  A  fmite  difference  scheme 
is  used  to  discretize  the  eontinuouj  equation.  The  difference 
operator  chosen  is  fourth  order  in  both  space  dimeniions  and 
ii  BMond  order  in  time.  The  notation 

is  used  in  writing  the  difference  operator  as  foUows: 
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Figure  S:  0@uic@s  send  Parameters  to  Nodes 


In  this  equation  v  is  the  velocity. 
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Figure  3:  Nodes  send  Uoltages  to  Oeuices 


Figure  6:  Nodes  Compute  new  Uoltages 


Figure  4:  0@uices  Compute  new  P@rameters 
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4     Seismic  Migration 

For  oil  aad  gas  exploration  geophyuciats  use  seismic  meth- 
ods for  acqui^riBg  subsuxface  itraetur^  and  stratigraphic  m- 
form&tion  without  drilling  weUs.  Propagating  elastic  waves  we 
recorded  at  the  surface  of  the  earth  and  subsequently  processed 
in  order  to  obtain  an  image  of  the  subsurface.  Seismic  migra- 
tion is  a  space  and  time  variant  filtering  process  which  maps 
observed  space-time  anapUtude  data  into  depth  with  correct 
amplitude  at  true  spatial  pcwitions.^' 

The  method  for  subsurface  imaging  disemsed  in  this  paper  is 
a  reverse  time  miration  method**;  specifically  we  ded  with 
the  acoustic  case  in  which  wave  propagation  is  governed  by  the 
acoustic  wave  equation.  The  algorithm  is  based  on  a  boimdEiry 
value  solution  of  the  2-D  acoustic  wave  equation 
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where  U  is  the  acoustic  wave  field.  A  finite  difference  schem.e 
is  used  to  discretize  the  continuous  equation.  The  difference 
operator  chosen  is  fourth  order  in  both  space  dimensions  and 
ii  second  order  is  time.  The  notation 

is  used  in  writing  the  difference  operator  as  follows: 
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Figure  S:  Oeuices  send  Parameters  to  Nodes 


In  this  equation  v  is  the  velocity. 


WitE  thk  rtpressatation  of  objeeti  in  proceMmf  eltmioti,  the 
algorithm  prectedi  m  fdlowt: 

Select  rando^r  •  lobstt  (e.g.,  40%)  of  pwti 

/•*  gdtction  of  new  »tmti  •/ 

Putt  (tdeeted)  r^d^nly  sdeet  a  new  dot  to  move  tow&rdt 

/'Parts  awap  pMitiom  with  puti  foxmd  b  the  seleettd  slot^/ 

/*th@  implemtQtation  guwuiteo  that  parts  found  m  new  slots 

M@  dujoint  fmm  the  origiaaDy  idected  parts*/ 

/*  ealeulatioa  of  ehaage  is  wktlength  (a  done  individually  for 

Msh  ■w*pping  pair)  •/ 

Puts  md  thek  b@w  pMitioa  to  nodes 

Nod«  caleukte  new  wirelragth  md  return  rwults  to  pwts 

Puts  ealeulAti  eh^ge  in  wireleaglh  for  aU  nodes  they  are  eon- 

nested  to 

Pmrts  ealeulmte  ehange  in  wirelength  and  associated  cost  for 

swapping  paLn 

/«  calculation  of  rowlength  change  */ 

Parts  compute  change  of  rowlength  and  associated  ec^t  for  each 

swapping  pur 

Parti  compute  final  cost  for  swaps 

/*  accept  or  reject  */ 

Parts  compute  whether  to  »eeept  or  reject  swap 

MoTts  are  findised  by  updatmg  data  in  parti  and  slots 

With  the  parallel  i^tlesMntation  of  the  annealing-baaed,  al- 
gorithm we  have  tried  to  stay  dose  to  the  first  two  stages  of 
the  TimbwWolf  algorithm^.  lostead  of  trying  to  only  permit 
independent  mova,  we  have  implecMnted  ui  algorithm  which 
toltratet  some  error  m  the  cMt  ealeala.tioQ.  Errors  are  due  to 
the  iatofiraiee  of  pwaflel  swaps  (e.g.,  two  swapping  pairs  may 
shwe  a  common  net).  Gcn^M^  to  an  algorithm  with  only  in- 
dep^dent  swaps  which  was  outlined  for  a  simple  version  of  an 
aane^ing-bMed  algorithm  in  Baferenee  4,  the  above  algorithm 
ns^  l^s  tmie  p@  itCTation  mnd  permitg  more  parallel  moves 
per  iteration.  During  the  mitial  stage  of  the  aaneaUng  pro- 
cos  approimatdy  7S%  of  ttt  puts  partidpate  la  attempted 
mov«.  Am  the  tsap^atw®  decrea»M  the  algorithm  deereaa« 
tlus  pere^tage  m  a  global  m@«ii  to  reduce  the  oror.  At  the 
same  time  the  numba  of  ito'atiou  per  temperature  step  is 
increased.  Spedfie  puameten  for  these  functions  were  deter- 
mined e^crjmoit^y. 

The  optimisation  raults  of  thk  algorithm  axe  prmetieaUy  the 
same  as  for  TimberWoH  The  extention  time  of  this  algorithm 
for  a  circuit  with  more  th«a  6,000  pKts  is  appromoately  2 
hours  on  a  Cosneetion  Machine  ccafiguzation  with  32K  proces- 
sors. On  a  full  64K  pr©e«i0f  configuration  a  circuit  with  ap- 
promnatdy  15,000  parts  (40K  gate  equivalents)  can  be  placed 
in  2  hours. 


3     VLSI  Circtiit  Simulation 

Today's  VLSI  cireuits  commonly  have  tens  to  hundreds  of  thou- 
sands of  transistors  on  a  single  dup.  For  accurate  verification 
of  the  timing  behavior,  circuits  are  simulated  at  the  level  of 
analog  waveforms.  This  simnlAtion  is  so  computation-intensive 
that  typicaOy  only  drcuits  with  SO  to  several  hundred  transis- 
ton  ue  simulated  at  a  tisM. 


The  cireuit  simulation  problem  h  formiUattd  u  a  system  of 
nonUnear  differential  equstiou^  A  spane  matrix  whwe  sise 
depends  on  the  circuit  siie  aeedi  to  be  solved  in  the  innerloop 
of  the  idgorithm  which  solva  thk  problem. 

In  this  paper  we  dkcuss  a  rdaxation-bMed  approach^*'^"*''  te 
solving  this  problem,  ipedficmUy  a  Gauss-Jacobi  relaxation  at 
the  nonUnew  equation  levd***.  This  iterative  method  permits 
par^d  e^culAtion  of  aU  unknown  voltage  of  a  cireuit  in  the 
innerloop  of  the  algorithm. 

The  usemt  to  be  simulated  b  represented  in  the  machine  by 
storing  each  device  uid  each  node  m  a  separate  procaiing  ele- 
m«it  (Fig.  2).  Pointeri  or  referencM  repr^ent  the  connectivity 
between  devices  and  nodes. 


Figure  2:  Bepresentstlon  of  Cireuit  on  Processing  Elements 
(Shaded  areas  represent  processing  elements) 


With  this  representation  of  the  dreuit,  the  algorithm  proceeds 
as  follows: 

For  each  Ume  point  the  following  steps  are  iterated  untO  con- 
vergence is  reached 

all  nodes  tend  thdr  voltagM  to  the  device  termin&k  (Fig.  3) 
aU  devices  compute  their  sew  pwameten  (Fig.  4) 
^  devicM  send  their  pwameters  to  the  noda  (Fig.  5) 
aB.  nodes  compute  their  new  voltagM  (Fig.  6) 

Circuits  with  more  than  10,000  devices  can  be  sunukted  is  a 
few  minute  on  a  Connection  Machine  system  with  64K  proca- 
sors.  The  performance  of  the  impienKntation  can  be  improved 
by  storing  nodes  and  parts  in  the  same  physical  processor  since 
only  one  type  of  object  is  computing  at  any  point  in  time. 


Figure  7  iUutratw  the  compatmtioa  for  thk  method.  For  weh 
time  step  the  depth  section  needi  to  be  computed.  This  com- 
putation involves  reference  to  the  depth  leetion  of  the  two 
previouj  time  steps,  the  row  of  the  source  data  for  the  corre- 
iponding  time  step  Mid  the  velc^ity  model.  Figure  8  shows  the 
computation  of  individual  grid  points  of  the  depth  section.  The 
Fig\u'e  shows  a  limpMed  computation  of  a.  second  order  finite 
difference  scheme,  whereas  the  actual  implementation  is  fourth 
order.  As  one  ewi  tee,  ^  pid  points  of  the  depth  section  for 
a  pven  time  point  can  be  computed  in  parallel.  The  following 
summarises  a  data  parallel  algorithm.  A  more  detailed  descrip- 
tion can  be  found  in  Reference  13. 
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Figur@  8:  Computation  of  Seeond  Order  Oltfertnc@  Operator 


Figure  7:  ReverM  Time  Migralion 


A  processing  element  is  assigned  to  each  grid  point  in  the  depth 
section.  The  processing  element  holds  also  the  data  for  this 
grid  point  which  corresponds  to  the  depth  sections  of  the  pre- 
vious time  points  and  of  the  velocity  model  The  source  data 
is  stored  by  distributing  it  over  the  procnsing  elements  of  the 
depth  section.  Fig.  9  stimmarizes  the  mapping  of  data  objects 
to  processing  elements  in  this  problemu 
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Figure     B:  RapreMnUtlon  ol  Daplh  S#ellon,  Velodly 
Soures  DaU  en  ProMuIng  Elarmnts 
(SnacMQ  arsas  ra|»asani  prcffia&sms  MrT^nu) 
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The  foUowmg  eomputstion  t»kM  pl^e  for  tach  iim®  »ttp.  The 
louree  data  uf&y  b  tMft@d  up  ont  it@p.  The  sow  which  is 
shifted  out  «t  the  top  ii  plK@d  onto  the  top  of  the  depth  t^- 
tio&.  AU  proeMsiag  elemtats  compute  a  depth  point  in  pwal- 
lei  Kceasing  their  aew^t  and  second  nearest  neighbors  in  two 
space  dimeniions  and  in  the  tii^  dimension. 

The  actual  implementation  of  tUi  algorithm  utilize  more  than 
75%  of  the  peak  perfonaanee  of  the  Connection  Mftchine. 

5     Conclusioji 

We  have  discussed  the  development  of  ^gorithnyi  for  a  mv- 
sivdy  pwaUd  computer.  Problani  such  as  circuit  pkeement, 
drcmt  simulation  and  seiimie  mgration  have  significantly  more 
Inherent  pwdUeUsm  than  conventional  computers  or  even  vec- 
tor maehinei  can  utilise.  Hence,  a  maiiively  parallel  machine 
c^  provide  subitmnti^  speedups  for  those  problems.  This 
speedup  also  mako  possible  appUeations  which  have  been  pro- 
lubitive  so  far.  Much  larger  problems  can  be  solved,  for  instance 
ihe  simulataon  of  complete  chips  at  the  analog  leveL 

The  data  paraUel  progrMsming  style  is  effectively  supported 
by  mMsively  parallel  machine  and  makes  program  develop- 
ment easy.  Associating  a  data  object  with  a  processing  element 
makes  it  straightformrd  to  dMi^pose  a  task  into  subtasks 
which  can  be  executed  in  paraUeL  A  key  step  in  developing 
data  pwaUtl  df  orithms  is  to  define  appropriate  data  structura 
and  an  appropriate  mapping  of  th«@  structure  onto  processing 
elements.  Smee  the  data  obJMts  which  are  computed  in  par- 
allel are  g^v^y  of  the  same  type,  a  single  program  eontrd 
stream  it  suffieieDt,  Le.,  no  i^ehrcoisation  problems  between 
coneunent  processes  need  to  be  solved  by  the  programmer. 
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The  foUowing  eomput»tion  t»kM  plm@  for  taeh  time  itep.  The 
louree  data  array  u  shifted  up  oae  itep.  The  row  which  is 
shifted  out  at  the  top  u  placed  o&to  the  top  of  the  depth  lee- 
tion.  M  proe^sing  elementi  compute  a  depth  point  in  par^- 
lel  accessing  their  aear^t  Mid  second  nearest  neighbors  in  two 
space  dimensions  and  in  the  time  dimension. 

The  actual  implementation  of  this  ^gorithm  utilises  more  than 
75%  of  the  peak  performance  of  the  Connection  Machine. 

5     Conclusion 

We  have  discussed  the  development  of  dgorithms  for  a  mas- 
sively puaUel  computer.  Problems  such  as  circuit  placement, 
circuit  simulation  uid  seismic  migration  have  significantly  more 
inherent  parallelism  than  conventional  computers  or  even  vec- 
tor machines  can  utilize.  Hence,  a  massively  parallel  machine 
can  provide  lubstuitial  speedups  for  those  problems.  This 
speedup  also  makes  possible  applications  which  have  been  pro- 
lubitive  so  far.  Much  larger  problems  can  be  solved,  for  instance 
Ihe  simulation  of  complete  chips  at  the  analog  level. 

The  data  parallel  programming  style  is  effectively  supported 
by  massively  p&raUel  machines  and  makes  program  develop- 
ment easy.  Associating  a  data  object  with  a  processing  element 
makes  it  straightforward  to  decompose  a  task  into  subtasks 
which  can  be  executed  in  p&raUeL  A  key  step  in  developing 
data  parallel  algorithms  is  to  define  appropriate  data  structure 
and  an  appropriate  mapping  of  thcwe  structures  onto  processing 
elemoits.  Smee  the  data  objects  which  are  computed  in  par- 
aUel  are  generally  of  the  same  type,  a  single  program  control 
stream  is  sufficient,  Le.,  no  synehrooi^ation  problems  between 
concurrent  processes  need  to  be  solved  by  the  programmer. 
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Fifw%  7  Uluit?fttM  the  eompiii«tio&  for  tim  m@thod.  For  CAeh 
time  step  the  depth  teetioii  swdi  to  bt  eomputed.  This  eom- 
putation  iavt>W@8  nl^nnen  to  th@  d@pth  i@etioii  of  the  two 
previouj  time  steps,  the  row  ef  the  louice  data  for  the  eorre- 
ipondinf  time  step  uid  the  ^dMity  model  Figure  8  show*  the 
computatioa  of  individual  pid  points  of  the  depth  itetion.  The 
Figure  shows  a  gimpUAed  computation  of  a  second  order  &ute 
dUTerenee  scheme,  whereai  the  Mtual  implementation  is  fourth 
order.  As  one  can  see,  ^  pid  points  of  the  depth  section  for 
a  given  tme  point  can  bt  counted  m  panJlel.  The  following 
summatigts  m  data  pwaUtl  algorithm.  A  more  detailed  descrip- 
tion can  be  found  in  Referenee  13. 
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Figura  S:  Computntion  of  S«eond  Order  Dmefene^  Operator 


Figur*  7:  Rtver^  Tim*  Migration 


A  processing  element  is  assipied  to  each  grid  point  in  the  depth 
section.  The  procosing  element  holds  ako  the  data  for  tMs 
grid  point  which  corresponds  to  the  depth  sections  of  the  pre° 
vious  time  points  uid  of  the  velocity  model  The  source  data 
is  stored  hy  distributing  it  over  the  procasing  elements  of  the 
depth  section.  Fig.  9  sumimuija  the  mapping  of  data  objects 
to  processing  elements  in  this  problem. 
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FIgur®     9:  RsprsiMnUlleR  of  Dsplt)  SMiisn,  Veteelly 
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